==========================================================
## [1] "ï..OBJECTID" "Parcel"
## [3] "XRefParcel" "Address"
## [5] "DateParcelChanged" "PropertyClass"
## [7] "PropertyUse" "AssessmentArea"
## [9] "AreaName" "MoreThanOneBuild"
## [11] "HomeStyle" "DwellingUnits"
## [13] "Stories" "YearBuilt"
## [15] "Bedrooms" "FullBaths"
## [17] "HalfBaths" "TotalLivingArea"
## [19] "FirstFloor" "SecondFloor"
## [21] "ThirdFloor" "AboveThirdFloor"
## [23] "FinishedAttic" "Basement"
## [25] "FinishedBasement" "ExteriorWall1"
## [27] "ExteriorWall2" "Fireplaces"
## [29] "CentralAir" "PartialAssessed"
## [31] "AssessedByState" "CurrentLand"
## [33] "CurrentImpr" "CurrentTotal"
## [35] "PreviousLand" "PreviousImpr"
## [37] "PreviousTotal" "NetTaxes"
## [39] "SpecialAssmnt" "OtherCharges"
## [41] "TotalTaxes" "LotSize"
## [43] "Zoning1" "Zoning2"
## [45] "Zoning3" "Zoning4"
## [47] "FrontageFeet" "FrontageStreet"
## [49] "WaterFrontage" "TIFDistrict"
## [51] "TaxSchoolDist" "AttendanceSchool"
## [53] "ElementarySchool" "MiddleSchool"
## [55] "HighSchool" "Ward"
## [57] "StateAssemblyDistrict" "RefuseDistrict"
## [59] "RefuseURL" "PreviousLand2"
## [61] "PreviousImpr2" "PreviousTotal2"
## [63] "AlderDistrict" "AssessmentChangeDate"
## [65] "BlockNumber" "BuildingDistrict"
## [67] "CapitolFireDistrict" "CensusTract"
## [69] "ConditionalUse" "CouncilHold"
## [71] "DateAdded" "DeedPage"
## [73] "DeedRestriction" "DeedVolume"
## [75] "ElectricalDistrict" "EnvHealthDistrict"
## [77] "ExemptionType" "FireDistrict"
## [79] "FloodPlain" "FuelStorageProximity"
## [81] "HeatingDistrict" "Holds"
## [83] "IllegalLandDivision" "LandfillProximity"
## [85] "LandfillRemediation" "Landmark"
## [87] "LandscapeBuffer" "LocalHistoricalDist"
## [89] "LotDepth" "LotNumber"
## [91] "LotteryCredit" "LotType1"
## [93] "LotType2" "LotWidth"
## [95] "MCDCode" "NationalHistoricalDist"
## [97] "NeighborhoodDesc" "NeighborhoodPrimary"
## [99] "NeighborhoodSub" "NeighborhoodVuln"
## [101] "NoiseAirport" "NoiseRailroad"
## [103] "NoiseStreet" "ObsoleteDate"
## [105] "OwnerChangeDate" "OwnerOccupied"
## [107] "ParcelChangeDate" "ParcelCode"
## [109] "ParkProximity" "Pending"
## [111] "PlanningDistrict" "PlumbingDistrict"
## [113] "PoliceDistrict" "PoliceSector"
## [115] "PreviousClass" "PropertyUseCode"
## [117] "RailroadFrontage" "ReasonChangeImpr"
## [119] "ReasonChangeLand" "SenateDistrict"
## [121] "SupervisorDistrict" "TifImpr"
## [123] "TifLand" "TifYear"
## [125] "TotalDwellingUnits" "TrafficAnalysisZone"
## [127] "TypeWaterFrontage" "UWPolice"
## [129] "WetlandInfo" "ZoningAll"
## [131] "ZoningBoardAppeal" "UrbanDesignDistrict"
## [133] "HouseNbr" "StreetDir"
## [135] "StreetName" "StreetType"
## [137] "Unit" "StreetID"
## [139] "StormOutfall" "FireDemandZone"
## [141] "FireDemandSubZone" "PropertyChangeDate"
## [143] "MaxConstructionYear" "XCoord"
## [145] "YCoord" "SHAPESTArea"
## [147] "SHAPESTLength"
## ï..OBJECTID Parcel XRefParcel
## "integer" "numeric" "numeric"
## Address DateParcelChanged PropertyClass
## "factor" "factor" "factor"
## PropertyUse AssessmentArea AreaName
## "factor" "integer" "factor"
## MoreThanOneBuild HomeStyle DwellingUnits
## "factor" "logical" "integer"
## Stories YearBuilt Bedrooms
## "numeric" "integer" "integer"
## FullBaths HalfBaths TotalLivingArea
## "integer" "integer" "integer"
## FirstFloor SecondFloor ThirdFloor
## "integer" "integer" "integer"
## AboveThirdFloor FinishedAttic Basement
## "integer" "integer" "integer"
## FinishedBasement ExteriorWall1 ExteriorWall2
## "integer" "factor" "factor"
## Fireplaces CentralAir PartialAssessed
## "integer" "factor" "logical"
## AssessedByState CurrentLand CurrentImpr
## "factor" "integer" "integer"
## CurrentTotal PreviousLand PreviousImpr
## "integer" "integer" "integer"
## PreviousTotal NetTaxes SpecialAssmnt
## "integer" "numeric" "numeric"
## OtherCharges TotalTaxes LotSize
## "numeric" "numeric" "numeric"
## Zoning1 Zoning2 Zoning3
## "factor" "factor" "factor"
## Zoning4 FrontageFeet FrontageStreet
## "factor" "numeric" "factor"
## WaterFrontage TIFDistrict TaxSchoolDist
## "factor" "integer" "logical"
## AttendanceSchool ElementarySchool MiddleSchool
## "factor" "factor" "factor"
## HighSchool Ward StateAssemblyDistrict
## "factor" "integer" "integer"
## RefuseDistrict RefuseURL PreviousLand2
## "factor" "factor" "integer"
## PreviousImpr2 PreviousTotal2 AlderDistrict
## "integer" "integer" "integer"
## AssessmentChangeDate BlockNumber BuildingDistrict
## "factor" "integer" "integer"
## CapitolFireDistrict CensusTract ConditionalUse
## "factor" "numeric" "integer"
## CouncilHold DateAdded DeedPage
## "integer" "factor" "integer"
## DeedRestriction DeedVolume ElectricalDistrict
## "integer" "integer" "integer"
## EnvHealthDistrict ExemptionType FireDistrict
## "integer" "factor" "integer"
## FloodPlain FuelStorageProximity HeatingDistrict
## "integer" "integer" "integer"
## Holds IllegalLandDivision LandfillProximity
## "factor" "integer" "integer"
## LandfillRemediation Landmark LandscapeBuffer
## "factor" "factor" "integer"
## LocalHistoricalDist LotDepth LotNumber
## "factor" "numeric" "integer"
## LotteryCredit LotType1 LotType2
## "integer" "factor" "factor"
## LotWidth MCDCode NationalHistoricalDist
## "numeric" "factor" "integer"
## NeighborhoodDesc NeighborhoodPrimary NeighborhoodSub
## "factor" "factor" "factor"
## NeighborhoodVuln NoiseAirport NoiseRailroad
## "factor" "integer" "integer"
## NoiseStreet ObsoleteDate OwnerChangeDate
## "integer" "logical" "factor"
## OwnerOccupied ParcelChangeDate ParcelCode
## "factor" "factor" "factor"
## ParkProximity Pending PlanningDistrict
## "integer" "logical" "factor"
## PlumbingDistrict PoliceDistrict PoliceSector
## "integer" "factor" "integer"
## PreviousClass PropertyUseCode RailroadFrontage
## "factor" "integer" "factor"
## ReasonChangeImpr ReasonChangeLand SenateDistrict
## "logical" "factor" "integer"
## SupervisorDistrict TifImpr TifLand
## "integer" "integer" "integer"
## TifYear TotalDwellingUnits TrafficAnalysisZone
## "integer" "integer" "integer"
## TypeWaterFrontage UWPolice WetlandInfo
## "factor" "factor" "factor"
## ZoningAll ZoningBoardAppeal UrbanDesignDistrict
## "factor" "integer" "factor"
## HouseNbr StreetDir StreetName
## "integer" "factor" "factor"
## StreetType Unit StreetID
## "factor" "factor" "integer"
## StormOutfall FireDemandZone FireDemandSubZone
## "factor" "integer" "integer"
## PropertyChangeDate MaxConstructionYear XCoord
## "factor" "integer" "numeric"
## YCoord SHAPESTArea SHAPESTLength
## "numeric" "numeric" "numeric"
## 'data.frame': 79022 obs. of 147 variables:
## $ ï..OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Parcel : num 6.08e+10 6.08e+10 6.08e+10 6.08e+10 6.08e+10 ...
## $ XRefParcel : num 6.08e+10 6.08e+10 6.08e+10 6.08e+10 6.08e+10 ...
## $ Address : Factor w/ 79021 levels "1 Abilene Ct",..: 19267 19373 19465 19704 19796 20005 20119 20199 59274 59239 ...
## $ DateParcelChanged : Factor w/ 886 levels "1993-04-28T00:00:00.000Z",..: 690 690 690 690 690 690 849 690 713 690 ...
## $ PropertyClass : Factor w/ 4 levels "Agricultural",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ PropertyUse : Factor w/ 305 levels "","0 unit Apartment",..: 273 273 273 273 273 273 273 273 273 273 ...
## $ AssessmentArea : int 1 1 1 1 1 1 1 1 1 1 ...
## $ AreaName : Factor w/ 434 levels "2 units in Area 115",..: 303 303 303 303 303 303 303 303 303 303 ...
## $ MoreThanOneBuild : Factor w/ 2 levels "","Has more than one building": 1 1 1 1 1 1 1 1 1 1 ...
## $ HomeStyle : logi NA NA NA NA NA NA ...
## $ DwellingUnits : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Stories : num 1 1 1 1 1 1 1 1 1 1 ...
## $ YearBuilt : int 1960 1959 1959 1962 1959 1962 1964 1965 1958 1959 ...
## $ Bedrooms : int 3 4 3 3 3 5 5 4 3 4 ...
## $ FullBaths : int 1 1 1 2 2 2 2 2 1 2 ...
## $ HalfBaths : int 2 1 1 1 0 0 0 0 1 0 ...
## $ TotalLivingArea : int 1371 1488 1290 1043 1386 1008 990 1076 1208 1603 ...
## $ FirstFloor : int 1371 1488 1290 1043 1386 1008 990 1076 1208 1603 ...
## $ SecondFloor : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ThirdFloor : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AboveThirdFloor : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FinishedAttic : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Basement : int 1254 1008 1066 1008 1386 1008 935 1040 1208 1120 ...
## $ FinishedBasement : int 686 350 564 607 550 750 637 860 380 560 ...
## $ ExteriorWall1 : Factor w/ 9 levels "","Aluminum/Vinyl",..: 9 9 2 9 2 9 9 2 2 9 ...
## $ ExteriorWall2 : Factor w/ 9 levels "","Aluminum/Vinyl",..: 1 1 1 1 1 1 3 1 1 1 ...
## $ Fireplaces : int 1 1 0 0 1 1 1 0 0 1 ...
## $ CentralAir : Factor w/ 3 levels "","NO","YES": 3 3 3 3 3 3 3 3 3 2 ...
## $ PartialAssessed : logi NA NA NA NA NA NA ...
## $ AssessedByState : Factor w/ 2 levels "","ASSESSED BY STATE": 1 1 1 1 1 1 1 1 1 1 ...
## $ CurrentLand : int 61700 66000 69700 67000 62700 57000 65100 58200 65200 62200 ...
## $ CurrentImpr : int 125000 114700 108700 126900 129500 112100 114000 137300 113900 125400 ...
## $ CurrentTotal : int 186700 180700 178400 193900 192200 169100 179100 195500 179100 187600 ...
## $ PreviousLand : int 61700 66000 69700 67000 62700 57000 65100 58200 65200 62200 ...
## $ PreviousImpr : int 116100 106100 97000 119400 120300 101000 105500 128000 102200 116500 ...
## $ PreviousTotal : int 177800 172100 166700 186400 183000 158000 170600 186200 167400 178700 ...
## $ NetTaxes : num 4138 3998 3945 4306 4267 ...
## $ SpecialAssmnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherCharges : num 0 0 568 0 0 ...
## $ TotalTaxes : num 4138 3998 4513 4306 4267 ...
## $ LotSize : num 14270 14718 18867 14984 13334 ...
## $ Zoning1 : Factor w/ 44 levels "A","AP","CC",..: 26 26 26 26 26 26 26 26 26 26 ...
## $ Zoning2 : Factor w/ 53 levels "","CN","HIS-FS",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Zoning3 : Factor w/ 24 levels "","HIS-L","HIS-MH",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Zoning4 : Factor w/ 4 levels "","W","WP-17",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ FrontageFeet : num 85.1 65.8 66 64.5 81.7 ...
## $ FrontageStreet : Factor w/ 2808 levels "","Aaron Ct",..: 1926 1926 1926 1926 1926 1926 1926 1926 1991 1991 ...
## $ WaterFrontage : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
## $ TIFDistrict : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TaxSchoolDist : logi NA NA NA NA NA NA ...
## $ AttendanceSchool : Factor w/ 9 levels "","De Forest",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ ElementarySchool : Factor w/ 31 levels "","Allis","Assigned",..: 13 13 13 13 13 13 13 13 13 13 ...
## $ MiddleSchool : Factor w/ 14 levels "","Assigned",..: 13 13 13 13 13 13 13 13 13 13 ...
## $ HighSchool : Factor w/ 6 levels "","East","Lafollette",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Ward : int 95 95 95 95 95 95 95 95 95 95 ...
## $ StateAssemblyDistrict : int 78 78 78 78 78 78 78 78 78 78 ...
## $ RefuseDistrict : Factor w/ 22 levels "00","01A","01B",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ RefuseURL : Factor w/ 12 levels "","http://www.cityofmadison.com/streets/documents/friA.pdf",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ PreviousLand2 : int 61700 66000 69700 67000 62700 57000 65100 58200 65200 62200 ...
## $ PreviousImpr2 : int 112600 102700 93700 115700 116700 77600 102200 124300 98900 113000 ...
## $ PreviousTotal2 : int 174300 168700 163400 182700 179400 134600 167300 182500 164100 175200 ...
## $ AlderDistrict : int 20 20 20 20 20 20 20 20 20 20 ...
## $ AssessmentChangeDate : Factor w/ 714 levels "","1981-01-04T00:00:00.000Z",..: 632 632 632 632 632 632 632 632 632 632 ...
## $ BlockNumber : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BuildingDistrict : int 2 2 2 2 2 2 2 2 2 2 ...
## $ CapitolFireDistrict : Factor w/ 2 levels " - ","1 - Downtown Fire Safety District": 1 1 1 1 1 1 1 1 1 1 ...
## $ CensusTract : num 5.02 5.02 5.02 5.02 5.02 5.02 5.02 5.02 5.02 5.02 ...
## $ ConditionalUse : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CouncilHold : int 0 0 0 0 0 0 0 0 0 0 ...
## $ DateAdded : Factor w/ 1305 levels "","1989-02-14T00:00:00.000Z",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ DeedPage : int 0 0 0 0 0 0 0 0 0 0 ...
## $ DeedRestriction : int 0 0 0 0 0 0 0 0 0 0 ...
## $ DeedVolume : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ElectricalDistrict : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EnvHealthDistrict : int 31 31 31 31 31 31 31 31 31 31 ...
## $ ExemptionType : Factor w/ 46 levels " - ","1 - State Property",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ FireDistrict : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FloodPlain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FuelStorageProximity : int 0 0 0 0 0 0 0 0 0 0 ...
## $ HeatingDistrict : int 2 2 2 2 2 2 2 2 2 2 ...
## $ Holds : Factor w/ 4526 levels "","HOLD: PLAJM @ 65 Buttonwood Ct",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ IllegalLandDivision : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LandfillProximity : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LandfillRemediation : Factor w/ 4 levels "","IN PROXIMITY TO KNOWN LANDFILL",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Landmark : Factor w/ 3 levels "","A","L": 1 1 1 1 1 1 1 1 1 1 ...
## $ LandscapeBuffer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LocalHistoricalDist : Factor w/ 6 levels " - ","1 - Mansion Hill Historic District",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LotDepth : num 0 0 0 0 0 0 0 0 150 0 ...
## $ LotNumber : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LotteryCredit : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LotType1 : Factor w/ 5 levels "1 - Regular",..: 1 1 2 2 1 2 2 1 1 2 ...
## $ LotType2 : Factor w/ 7 levels "0 - No Exception",..: 2 1 1 1 1 1 1 2 1 1 ...
## $ LotWidth : num 0 0 0 0 0 0 0 0 93 0 ...
## $ MCDCode : Factor w/ 1 level "MADC": 1 1 1 1 1 1 1 1 1 1 ...
## $ NationalHistoricalDist: int 0 0 0 0 0 0 0 0 0 0 ...
## $ NeighborhoodDesc : Factor w/ 11 levels "Allied Drive",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ NeighborhoodPrimary : Factor w/ 17 levels "0 - No description entered",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ NeighborhoodSub : Factor w/ 3 levels "0 - No description entered",..: 1 1 1 1 1 1 1 1 1 1 ...
## [list output truncated]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 144600 211500 318629 298000 97320000
The plots and summary above show the distribution of the feature of interest, CurrentTotal. This variable represents the total assessed value of a property (Land+Improvements). As the first histogram shows, the distribution is highly skewed, with a cluster of values around zero, and a max value of $97,320,000. A log transformation reveals a distribution with a peak around $200,000.
## summary(assessor_data$PropertyUse)
## Single family 46827
## Condominium 17017
## 2 Unit 3280
## Vacant 2983
## 4 unit Apartment 938
## Commercial exempt 907
## Agricultural 748
## 3 unit Apartment 577
## Condominium -other 573
## Office 2 sty or lg. 301
## Warehouse & office 248
## 8 unit Apartment 239
## Store 1 sty sm 214
## Condominium-notation 196
## Apartment & store 192
## Manufacturing 142
## Office - 1 story 134
## 6 unit Apartment 133
## 5 unit Apartment 123
## Warehouse 1 story 120
## M-1 vacant 111
## Condominium -office 95
## Shop center neighbor 88
## Restaurant 73
## Condominium -apt 72
## Condo -store/retail 69
## Pud vacant 68
## Bank, s & l 60
## C-2 parking lot 60
## Shop, 1 story sm. 60
## Tavern 50
## Gas & store 49
## Store 1 sty lg dept. 49
## C-2 vacant 48
## Condominium-Warehouse 45
## 7 unit Apartment 44
## Apartment & office 44
## C-1 vacant 44
## C-3l vacant 42
## Shop & office 41
## Other 39
## Store-warehse 1 sty. 39
## Garage, repair 38
## Rest drive-in w/seat 38
## Hotel 37
## Medical clinic 32
## Warehouse, mini type 32
## C-3 vacant 31
## Day care center 31
## Gar new car & repair 31
## Office converted sm. 31
## Rest. w/bar & liquor 31
## Frat & sorority lg. 29
## Commercial Exempt Condo 28
## Warehouse, small 28
## M-1 parking lot 27
## Restaurant & apts. 27
## Store & office small 26
## Store & shop 26
## Store 2 sty small 26
## 10 unit Apartment 24
## Rooming house 23
## Tavern & apartment 23
## 24 unit Apartment 22
## C-3 parking lot 21
## Motel 21
## 0 unit Apartment 20
## Office & retail 20
## Gar used car & fix 19
## 12 unit Apartment 18
## 16 unit Apartment 18
## 30 unit Apartment 18
## Nursing home 18
## Rpsm vacant 18
## 9 unit Apartment 17
## 13 unit Apartment 16
## 18 unit Apartment 16
## 20 unit Apartment 16
## 40 unit Apartment 16
## 72 unit Apartment 16
## Restaurant & office 16
## 14 unit Apartment 15
## Apartments & rooms 15
## 36 unit Apartment 14
## Office insur type lg 14
## Restaurant & store 14
## 11 unit Apartment 13
## 48 unit Apartment 13
## C-1 parking lot 13
## Garage, steel sm. 13
## 60 unit Apartment 12
## 64 unit Apartment 12
## C-3l parking lot 12
## Golf course 12
## Grocer, large 12
## Office medical 12
## Shop & warehouse 12
## Store, Big Box 12
## Shop & house 11
## (Other) 664
These plots explore the first explanatory feature of interest, PropertyUse. As the summary of PropertyUse shows, there are one hundred categories. I am interested in single-family/condominiums, which are the top two categories. The first histogram, which includes all values, is difficult to read. The second which plots only those with frequencies of more than 1000, visually summarizes the frequencies of the top four categories.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 862 1260 1267 1709 18144
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1018 1316 1360 1696 9342
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 157 1100 1375 1493 1745 9342
The histograms above show the distribution of TotalLivingArea. In the full raw dataset there are a large number of zeros, these are largely eliminated after subsetting the data to single family/condominiums.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3267 7800 23274 11250 24549987
The LotSize feature, which corresponds to the area of the property lot in square feet, is heavily skewed right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 3.436 3.000 720.000
The plots for Bedrooms is also skewed right with the interquartile range falling between 2 and 3 bedrooms. Restricting the range of the histogram better shows this distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.912 2.000 569.000
The distribution of bathrooms is also highly skewed right. The interquartile range is between 1 and 2 full bathrooms.
The histograms for HalfBaths indicate that this feature lacks variability as well, except for some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1916 1959 1576 1988 2017
The first plot of the raw data shows that there are over 15000 observations with YearBuilt = 0. Setting the minimum value of 1837, and adding tick marks at every 10 years, shows more clearly the distribution of YearBuilt. Of note is the presence of housing booms and busts across time, most clearly seen in the run up to the housing crisis of 2008, followed by the great recession.
## Allis Assigned Chavez
## 6115 2274 164 2781
## Crestwood Elvehjem Emerson Falk
## 3365 3193 3104 2686
## Franklin-Randall Glendale Gompers Hawthorne
## 9109 2564 2522 1726
## Huegel Kennedy Lake View Lapham-Marquette
## 2495 4089 607 5036
## Leopold Lindbergh Lowell Mendota
## 743 790 1944 1618
## Midvale-Lincoln Muir Olson Orchard Ridge
## 3636 2060 2556 1690
## Sandburg Schenk Shorewood Stephens
## 1796 2335 83 2940
## Thoreau To be determined Van Hise
## 2360 273 2368
This plot shows the number of observation by elementary school. Of note is the large number of observations that are not assigned to any school. Investigation showed that in many cases, these were for homes located in Madison, but falling in the school districts of adjacent communities.
Initial plots and tables show that in order to realistically analyze the effect of property features on assessed value, we need to subset the data, removing commercial and large multi-family properties. In addition, the large number of missing values for Elementary/Middle/High school need to be filled with the “AttendenceSchool” value where possible.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 172600 221500 250375 294700 4500000
Compared to the the raw data, the plots above, which are drawn from data that was subsetted to include only single family/condominium properties, shows a much tighter, right skewed, distribution
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1100 1375 1492 1745 9342
In previous iterations of this plot, I subset to excluded TotalLivingArea=0. This plot therefore does not show much change from plots 6, 7 above.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 4318 7998 8001 10725 873073
Restricting the dataset to Single Family Homes/Condominiums results in a less right skewed distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.00 3.00 2.98 3.00 12.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.737 2.000 8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4452 1.0000 4.0000
Restricting the dataset to Single Family Homes/Condominiums results in the removal of outliers such as the observation with 720 bedrooms.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1837 1952 1971 1970 1996 2017
This plot confirms that using the limits set in plot 17 result in a similar distribution on the subsetted data.
## Allis Assigned Chavez
## 1995 81 2530
## Crestwood De Forest Elvehjem
## 2894 120 2871
## Emerson Falk Franklin-Randall
## 2414 1963 4665
## Glendale Gompers Hawthorne
## 1813 1720 1234
## Huegel Kennedy Lake View
## 2163 3358 533
## Lapham-Marquette Leopold Lindbergh
## 2410 525 747
## Lowell Mc Farland Mendota
## 1431 341 1413
## Middleton/Cross Plains Midvale-Lincoln Muir
## 1703 3032 1736
## Olson Orchard Ridge Sandburg
## 1966 1532 1353
## Schenk Shorewood Stephens
## 1883 51 2422
## Sun Prairie Thoreau To be determined
## 590 1970 181
## Van Hise Verona Waunakee
## 1810 409 331
In addition to subsetting the data, I addressed the issue of missing schools, since location (as indicated by attendence school) is an explanatory feature of interest in this analysis. This plot reflects the updated value for “ElementarySchool”.
The Dataset has 79,022 observations, and 147 variables. It contains detailed information about the assessed properties as well as the assessed values for the current and previous years. For this analysis, I restricted the ‘PropertyUse’ variable to ‘Single Family’, or ‘Condominium’.
The main features of interest are the assessed values for land and improvements, for the current and previous years: “CurrentLand”, “CurrentImpr”, “CurrentTotal”, “PreviousLand”, “PreviousImpr”, “PreviousTotal”. For the purpose of this particular analysis, I focus on “CurrentTotal”.
I think that “LotSize”, “TotalLivingArea”, “PropertyUse”, “Bedrooms”, “FullBaths”, “HalfBaths”, and “YearBuilt” will support the investigation into assessed property value. Location is also important, but I will have to investigate to see which of the following location variables best predicts assessed value: “ElementarySchool”, “MiddleSchool”, “HighSchool”, “Ward”, “StateAssemblyDistrict”, “AlderDistrict”, “CensusTract”.
Yes. To generate a smoother distribution of prices, I did a log transformation. I also create a new variable called “HomeAge” by subtracting the value for “YearBuilt” (and adding 1) from the current year (for the assessment, this would be 2017).
In the raw data, Total Living Area has a sizeable number of zeros. There appear to be parcels that correspond to parking/storage for condos. To simplify the analysis, these are dropped.
## 'data.frame': 58190 obs. of 9 variables:
## $ CurrentTotal : int 186700 180700 178400 193900 192200 169100 179100 195500 179100 187600 ...
## $ PropertyUse : chr "Single family" "Single family" "Single family" "Single family" ...
## $ YearBuilt : int 1960 1959 1959 1962 1959 1962 1964 1965 1958 1959 ...
## $ TotalLivingArea : int 1371 1488 1290 1043 1386 1008 990 1076 1208 1603 ...
## $ Bedrooms : int 3 4 3 3 3 5 5 4 3 4 ...
## $ FullBaths : int 1 1 1 2 2 2 2 2 1 2 ...
## $ ElementarySchool: Factor w/ 36 levels "Allis","Assigned",..: 13 13 13 13 13 13 13 13 13 13 ...
## $ MiddleSchool : Factor w/ 19 levels "Assigned","Black Hawk",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ HighSchool : Factor w/ 11 levels "De Forest","East",..: 5 5 5 5 5 5 5 5 5 5 ...
The ggpairs plot above summarizes the relationships between the features of interst in this dataset. In particular, we see a relatively strong positive correlation between TotalLivingArea and CurrentTotal, and to a lesser extent a correlation between the number of full baths and the CurrentTotal.
The plot above shows that the mean assessed value for single family homes is slightly greater than the mean assessed value for condominiums. Moreover, the upper tail of the distribution for single family homes appears to be longer than that of condominiums.
The plot above shows that, as is to be expected, condominiums do not have values for LotSize. The bulk of the distribution of LotSize for Single Family Homes lies between 1,000 and 10,000 square feet.
As is to be expected, the plot above shows that the distribution of TotalLivingArea is wider for Single Family Homes than for condominiums.
The box plots above show that in addition to the wider distribution for Single Family Homes, condominiums have a lower median TotalLivingArea, which is intuitive.
The histogram above shows that, as one might expect, the housing stock of Single family homes is older than condominiums. The few condominiums that have YearBuilt values in the late 19th century early 20th century, were likely historic buildings (e.g. hotels) that were converted to condominiums recently.
These box plots show that the distribution of assessed value for single family homes contains both more outliers, and a tighter inter-quartile range. Moreover, as the histogram above indicated, the median and mean values for single family homes are greater than for condominiums.
The above box plots show that for single family homes the median and mean land value is greater than for condominiums. Interestingly the interquartile range is greater for condominiums.
The box plots above show that the mean and median values for the assessed values of single family and condominium structures are much closer. Again, we see that the interquartile range for condominiums is much greater.
The histogram above shows how the range of total living area varies by elementary school. It also indicates which elementary schools have a greater number of single family homes and condominiums.
The histogram above shows that, as one would expect, the central tendency of total living area increases with the number of bedrooms.
## [1] -0.04254596
The above plot and correlation coefficient (-0.043) show that there appears to be no relationship between when a house was built and its total assessed value.
## [1] 0.7423409
The plot and correlation coefficient above (0.742) indicate that there is a relatively strong positive relationship between the total living area of a home and its total assessed value.
## [1] 0.2503815
## Source: local data frame [2 x 2]
##
## PropertyUse COR
## 1 Condominium -0.01483127
## 2 Single family 0.18849254
There appears not to be a strong relationship between lot size and assessed value. However, this is likely due to the fact that many condominiums do not have lot sizes, or lot sizes of zero. Filtering out condos may reveal a stronger relationship for single family homes.
## Source: local data frame [2 x 2]
##
## PropertyUse COR
## 1 Condominium -0.01483127
## 2 Single family 0.18849254
Subsetting this to single family homes only appears to weaken the correlation between lot size and total assessed value.
The most important explanatory feature that I wanted to examine was PropertyUse. It is reasonable to believe that condominiums and single family detached homes (“SFDHs”) represent qualitatively different markets. As such, I investigated how the distribution of other features of interest varied across these groups. The most noteworthy finding was that for the vast majority of condominiums, LotSize=0. This has important implications for the inclusion of LotSize in any regression, since it strongly covaries with PropertyUse. The distribution of the main feature CurrentTotal was wider for condominiums than for SFDHs. When I looked at the distribution of the components of CurrentTotal (CurrentLand and CurrentImpr), both of these appeared to have larger spreads for condominiums than for SFDHs.
There is a relationship between the number of bedrooms and the TotalLiving Area.
Total Living Area and CurrentTotal.
## Source: local data frame [2 x 2]
##
## PropertyUse COR
## 1 Condominium 0.6849256
## 2 Single family 0.7513470
The positive correlation betweeen total living area and total assessed value is stronger for Single family homes than for condominiums.
## Source: local data frame [36 x 2]
##
## ElementarySchool COR
## 1 Allis 0.6226428
## 2 Assigned 0.5649764
## 3 Chavez 0.6978508
## 4 Crestwood 0.7568220
## 5 De Forest 0.2401669
## 6 Elvehjem 0.7539876
## 7 Emerson 0.6556453
## 8 Falk 0.8384986
## 9 Franklin-Randall 0.8346005
## 10 Glendale 0.7017174
## .. ... ...
The correlations between living area and assessed value vary considerably by elementary school. The general trend appears to be that the more desireable the location (as determined by elementary school), the higher the correlation between living area and assessed value. The exclusive enclave of Shorewood has the highest correlation.
## Source: local data frame [19 x 2]
##
## MiddleSchool COR
## 1 Assigned 0.5649764
## 2 Black Hawk 0.6918834
## 3 Cherokee 0.7674481
## 4 De Forest 0.2401669
## 5 Hamilton 0.7927054
## 6 Jefferson 0.7590353
## 7 Mc Farland 0.6117191
## 8 Middleton/Cross Plains 0.8547688
## 9 O'Keeffe 0.8107192
## 10 Opt Cherokee/Hamiltn 0.6970512
## 11 Opt Toki/Jefferson 0.8301759
## 12 Sennett 0.6598879
## 13 Sherman 0.6917482
## 14 Sun Prairie -0.1236515
## 15 To be determined 0.3007001
## 16 Toki 0.7939027
## 17 Verona 0.6482604
## 18 Waunakee 0.8270424
## 19 Whitehorse 0.6160075
The correlations for middle school areas are necessarily less extreme in range, as we aggregate up from elementary schools.
## Source: local data frame [11 x 2]
##
## HighSchool COR
## 1 De Forest 0.2401669
## 2 East 0.7064427
## 3 Lafollette 0.6377896
## 4 Mc Farland 0.6117191
## 5 Memorial 0.7612938
## 6 Middleton/Cross Plains 0.8547688
## 7 Optional 0.6428767
## 8 Sun Prairie -0.1236515
## 9 Verona 0.6482604
## 10 Waunakee 0.8270424
## 11 West 0.7683711
The highest level of aggregation shows that there are still distinct differences across broad parts of the city of Madison and surrounding towns, despite the fact that the overall distribution is compressed.
The plots show the relationship between living area and assessed value with outliers (below 1st percentile and above the 99th percentile) removed.
The plot above focuses on the relationship between total living area and assessed value for Madison East High School only.
The plot above focuses on the relationship between total living area and assessed value for Madison West High School only.
The plot above focuses on the relationship between total living area and assessed value for Madison Lafollete High School only.
The plot above focuses on the relationship between total living area and assessed value for Madison Memorial High School only.
The plot above focuses on the relationship between total living area and assessed value for Middleton/Cross Plains High School only.
The plot above focuses on the relationship between total living area and assessed value for condominiums only.
The plot above focuses on the relationship between total living area and assessed value for Single family homes only.
I was particularly interested in the relationship between house size, location, and assessed value. By adding a trend line to the scatter plots, I was able to find that different neighborhoods had different slopes, indicating that depending on where a house was located, the relationship between size and assessed value was stronger or weaker. ### Were there any interesting or surprising interactions between features? In addition to the fact that different neighborhoods have different relationships between house size and assessed value, different neighborhoods also showed varying degrees of spread in the data. In other words, the variation of assessed value conditional on house size was greater for some neighborhoods than for others. ### OPTIONAL: Did you create any models with your dataset? Discuss the strengths ### and limitations of your model.
The plot above breaks out the relationship between total living area and assessed value by the number of bedrooms. We see the general increase in total living area corresponds to an increasing number of bedrooms, as well as a much steeper relationship between total living area and assessed value for one bedroom homes.
This plot shows a clear difference by high school in the assessed value, conditional on total living area.
The distribution for Condominiums is bifurcated in a way that you don’t see for single family detached homes. This reflects the changing market for condos, where we see new luxury units at the high end of the value distribution, lower value units, but not as much in the middle of the distribution. Overlaying the scatter plots shows this clearly.
I found that cleaning and subsetting the data was the most challenging aspect of this project. After an initial exploration, it was clear that the dataset included property types that I was not interested in examining, as well as wrinkles, such as the the fact school values were missing for properties that were not in the Madison school district. Further work could focus on developing an explicit model for assessed value based on the variables above.